Discrete Data Types

Defining Categorical Types

Rodney Dyer, PhD

 

Student Learning Objectives

At the end of this topic you should be able to:

  • Create, manage, and use factor data types.
  • Perform statistical analysis on singe-vectors of count data.
  • Create and analyze discrete data in a contingency table approach.

Topics Covered

In this brief presentation, we’ll be introducing the following items:

  • Discrete Data & Factor Data Types
  • The 🐈 🐈 🐈‍⬛ 🐈 library
  • Formal Binomial Tests
  • Contingency Table Analysis

Discrete Data

Categorical Data Types

 

Unique and individual grouping that can be applied to a study design.

  • Case sensitive
  • Can be ordinal
  • Typically defined as character type
weekdays <- c("Monday","Tuesday","Wednesday",
              "Thursday","Friday","Saturday", 
              "Sunday")

 

class( weekdays )
[1] "character"

 

weekdays
[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
[7] "Sunday"   

Making Up Data 🤷🏻‍

The function sample() allows us to take a random sample of elements from a vector of potential values.

chooseOne <- sample( c("Heads","Tails"), size=1 )
chooseOne
[1] "Tails"

Making Up More Data 🤷🏻‍

However, if we want a large number items, we can have them with or without replacement.

sample( c("Heads","Tails"), size=10, replace=TRUE )
 [1] "Tails" "Heads" "Heads" "Tails" "Tails" "Heads" "Tails" "Tails" "Heads"
[10] "Heads"

Weekdays as example

We’ll pretend we have a bunch of data related to the day of the week.

days <- sample( weekdays, size=40, replace=TRUE)
summary( days )
   Length     Class      Mode 
       40 character character 
days
 [1] "Saturday"  "Tuesday"   "Friday"    "Wednesday" "Monday"    "Saturday" 
 [7] "Wednesday" "Tuesday"   "Friday"    "Wednesday" "Saturday"  "Saturday" 
[13] "Thursday"  "Sunday"    "Friday"    "Sunday"    "Sunday"    "Friday"   
[19] "Thursday"  "Tuesday"   "Tuesday"   "Wednesday" "Thursday"  "Saturday" 
[25] "Tuesday"   "Monday"    "Monday"    "Thursday"  "Wednesday" "Wednesday"
[31] "Friday"    "Friday"    "Saturday"  "Friday"    "Sunday"    "Tuesday"  
[37] "Tuesday"   "Thursday"  "Wednesday" "Sunday"   

Turn it into a factor

data <- factor( days )
is.factor( data )
[1] TRUE
class( data )
[1] "factor"

Data Type Specific Printing & Summaries

data
 [1] Saturday  Tuesday   Friday    Wednesday Monday    Saturday  Wednesday
 [8] Tuesday   Friday    Wednesday Saturday  Saturday  Thursday  Sunday   
[15] Friday    Sunday    Sunday    Friday    Thursday  Tuesday   Tuesday  
[22] Wednesday Thursday  Saturday  Tuesday   Monday    Monday    Thursday 
[29] Wednesday Wednesday Friday    Friday    Saturday  Friday    Sunday   
[36] Tuesday   Tuesday   Thursday  Wednesday Sunday   
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

 

  Notice Levels.

Factor Levels

Each factor variable is defined by the levels that constitute the data. This is a finite set of unique values.

levels( data)
[1] "Friday"    "Monday"    "Saturday"  "Sunday"    "Thursday"  "Tuesday"  
[7] "Wednesday"

Factor Ordination

If a factor is not ordinal, it does not allow the use relational comparison operators.

data[1] < data[2]
Warning in Ops.factor(data[1], data[2]): '<' not meaningful for factors
[1] NA

Ordination = Ordered

is.ordered( data )
[1] FALSE

Ordination of Factors

Where ordination matters:

  • Fertilizer Treatments in KG of N2 per hectare: 10 kg N2, 20 N2, 30 N2,

  • Days of the Week: Friday is not followed by Monday,

  • Life History Stage: seed, seedling, juvenile, adult, etc.

Where ordination is irrelevant:

  • River

  • State or Region

  • Sample Location

Making Ordered Factors

data <- factor( days, ordered = TRUE)
is.ordered( data )
[1] TRUE
data
 [1] Saturday  Tuesday   Friday    Wednesday Monday    Saturday  Wednesday
 [8] Tuesday   Friday    Wednesday Saturday  Saturday  Thursday  Sunday   
[15] Friday    Sunday    Sunday    Friday    Thursday  Tuesday   Tuesday  
[22] Wednesday Thursday  Saturday  Tuesday   Monday    Monday    Thursday 
[29] Wednesday Wednesday Friday    Friday    Saturday  Friday    Sunday   
[36] Tuesday   Tuesday   Thursday  Wednesday Sunday   
7 Levels: Friday < Monday < Saturday < Sunday < Thursday < ... < Wednesday

The problem is that the default ordering is actually alphabetical!

Specifying the Order

Specifying the Order of Ordinal Factors

data <- factor( days, ordered = TRUE, levels = weekdays)
data
 [1] Saturday  Tuesday   Friday    Wednesday Monday    Saturday  Wednesday
 [8] Tuesday   Friday    Wednesday Saturday  Saturday  Thursday  Sunday   
[15] Friday    Sunday    Sunday    Friday    Thursday  Tuesday   Tuesday  
[22] Wednesday Thursday  Saturday  Tuesday   Monday    Monday    Thursday 
[29] Wednesday Wednesday Friday    Friday    Saturday  Friday    Sunday   
[36] Tuesday   Tuesday   Thursday  Wednesday Sunday   
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Sorting Is Now Relevant

sort( data )
 [1] Monday    Monday    Monday    Tuesday   Tuesday   Tuesday   Tuesday  
 [8] Tuesday   Tuesday   Tuesday   Wednesday Wednesday Wednesday Wednesday
[15] Wednesday Wednesday Wednesday Thursday  Thursday  Thursday  Thursday 
[22] Thursday  Friday    Friday    Friday    Friday    Friday    Friday   
[29] Friday    Saturday  Saturday  Saturday  Saturday  Saturday  Saturday 
[36] Sunday    Sunday    Sunday    Sunday    Sunday   
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday

Fixed Set of Levels

You cannot assign a value to a factor that is not one of the pre-defined levels.

data[3] <- "Bob"
Warning in `[<-.factor`(`*tmp*`, 3, value = "Bob"): invalid factor level, NA
generated

🐈 🐈 🐈‍⬛ 🐈
forcats

The forcats library

Part of the tidyverse group of packages.

library(forcats)

This library has a lot of helper functions that make working with factors a bit easier. I’m going to give you a few examples here but strongly encourage you to look a the cheat sheet for all the other options.

Counting Factors

A summary of levels.

fct_count( data )
# A tibble: 8 × 2
  f             n
  <ord>     <int>
1 Monday        3
2 Tuesday       7
3 Wednesday     7
4 Thursday      5
5 Friday        6
6 Saturday      6
7 Sunday        5
8 <NA>          1

Lumping Factors

Coalescing low frequency samples.

lumped <- fct_lump_min( data, min = 5)
fct_count(  lumped )
# A tibble: 8 × 2
  f             n
  <ord>     <int>
1 Tuesday       7
2 Wednesday     7
3 Thursday      5
4 Friday        6
5 Saturday      6
6 Sunday        5
7 Other         3
8 <NA>          1

Reordering Factors

We can reorder by one of several factors including: appearance order, # of observations, or numeric

By Appearance

freq <- fct_inorder( data )
levels( freq )
[1] "Saturday"  "Tuesday"   "Wednesday" "Monday"    "Friday"    "Thursday" 
[7] "Sunday"   

By Order of Appearance

ordered <- fct_infreq( data )
levels( ordered )
[1] "Tuesday"   "Wednesday" "Friday"    "Saturday"  "Thursday"  "Sunday"   
[7] "Monday"   

Reorder Specific Levels

This pulls out specific level instances and puts them in the lead position in the order that you give them following the vector of data types.

newWeek <- fct_relevel( data, "Saturday", "Sunday")
levels( newWeek )
[1] "Saturday"  "Sunday"    "Monday"    "Tuesday"   "Wednesday" "Thursday" 
[7] "Friday"   

Dropping Missing Levels

There are times when some subset of your data does not include every an example of each level, here is how you drop them.

data <- sample( weekdays[1:5], size=40, replace=TRUE )
data <- factor( data, ordered=TRUE, levels = weekdays )
summary( data )
   Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
        8         9         7         8         8         0         0 

Dropping Missing Levels

fct_drop( data ) -> dropped
summary( dropped )
   Monday   Tuesday Wednesday  Thursday    Friday 
        8         9         7         8         8 

Example iris Data

Ronald Aylmer Fisher
1890 - 1962

British polymath, mathematician, statistican, geneticist, and academic. Founded things such as:

  • Fundamental Theorem of Natural Selection,
  • The F test,
  • The Exact test,
  • Linear Discriminant Analysis,
  • Inverse probability
  • Intra-class correlations
  • Sexy son hypothesis…. 🥰

head( iris, )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
summary( iris )
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Operating on Factor Levels

Question: What is the mean and variance in sepal length for each of the Iris species?

The by() function allows us to perform some function on data based upon a grouping index.

by( data, index, function )

by() the ‘Classic Approach’

Here we can apply the function mean() to the data on sepal length using the species factor as a category.

by( iris$Sepal.Length, iris$Species, mean)
iris$Species: setosa
[1] 5.006
------------------------------------------------------------ 
iris$Species: versicolor
[1] 5.936
------------------------------------------------------------ 
iris$Species: virginica
[1] 6.588

by()

The same for estimating variance

by( iris$Sepal.Length, iris$Species, var)
iris$Species: setosa
[1] 0.124249
------------------------------------------------------------ 
iris$Species: versicolor
[1] 0.2664327
------------------------------------------------------------ 
iris$Species: virginica
[1] 0.4043429

group_by() the Tidy Way

For a less “money sign” approach…

library( tidyverse )
iris |>
  group_by( Species) |>
  summarize( Mean = mean(Sepal.Length),
             Variance = var(Sepal.Length))
# A tibble: 3 × 3
  Species     Mean Variance
  <fct>      <dbl>    <dbl>
1 setosa      5.01    0.124
2 versicolor  5.94    0.266
3 virginica   6.59    0.404

Binomial Tests

The Binomial

We have already seen the binomial distribution that is applicable for data that has 2-categories.

\[ P(K|N,p) = \frac{N!}{K!(N-K!)}p^K(1-p)^{N-K} \]

A Binomial Test

\(H_O: p = \hat{p}\)

Example

Consider the following sampling of catfish from our hypothetical examples.

fish_sample
 [1] Catfish Catfish Catfish Other   Catfish Catfish Catfish Catfish Other  
[10] Other   Catfish Catfish Other   Catfish Catfish Catfish Catfish Other  
[19] Catfish Other   Other   Other   Catfish Catfish Other   Catfish Other  
[28] Catfish Catfish Catfish Catfish Other   Catfish Other   Catfish Other  
[37] Other   Catfish Catfish Catfish Catfish Catfish Other   Other   Catfish
[46] Other   Catfish Catfish Catfish Catfish
Levels: Catfish Other
fct_count( fish_sample )
# A tibble: 2 × 2
  f           n
  <fct>   <int>
1 Catfish    33
2 Other      17

Statistical Test

We could test the hypothesis that The frequency of catfish in this sampling effort is not significantly different our previously observed frequency of 74%.

Formally, we would specify.

\(H_O: p = 0.74\)

and test it against the alternative hypothesis

\(H_A: p \ne 0.74\)

fit <- binom.test( x = 33, n = 50, p = 0.74)

Interpretation

fit

    Exact binomial test

data:  33 and 50
number of successes = 33, number of trials = 50, p-value = 0.1991
alternative hypothesis: true probability of success is not equal to 0.74
95 percent confidence interval:
 0.5123475 0.7879453
sample estimates:
probability of success 
                  0.66 

From this, we can see the following components:

  1. The data consist of 33 catfish from a sample of size 50 (double check with data to make sure you didn’t make a typo…),
  2. The Probability associated with observing 33/50 if the TRUE frequency is \(0.74\),
  3. A 95% confidence interval on the probability itself,
  4. The original frequency of success in the new data.

The Multinomial

For completeness, this approach can be extended to data with more than two categroies using the multinomial distribution.

\[ P(X_1 = p_1, X_2 = p_2, \ldots, X_K = p_k|N) = \frac{N!}{\prod_{i=1}^{K}x_i!}\prod_{i=1}^Kp^{x_i} \]

We will skip this for simplicity as there are other ways we can evaluate it.

Contingency Tables

Contingency Tables

A contingency table is a method to determine if the count of the data we see in 2 or more categories conform to expectations based upon hypothesized frequencies. Contingency tables use expectations based upon the \(\chi^2\) distribution to evaluate statistical confidence.

\[ P(x) = \frac{1}{2^{k/2}\Gamma(k/2)}x^{k/2-1}e^{-x/2} \]
x = observed \(\chi^2\) statistic; k = degrees of freedom, \(\Gamma\) is the gamma function.

The Fish Data

Thinking a bit differently, when we originally specified the expectation of the number of catfish, we only estimated \(p_{catfish}\) and said \(q = 1-p =\; not \; catfish\). For contingency tables, we use these as a the vector of probabilities and call this our expected frequencies.

\[ p = \left( \frac{37}{50}, \frac{13}{50} \right) = (0.74,0.26) \]

More Generally

We can use this as a general framework for 2 or more categories of observations.

\[ E = [E_1, E_2] \]

If we had \(K\) different species of fishes, it could similarily be expanded to something like:

\[ E = [E_1, E_2, \ldots, E_K] \]

Expecations as Counts

This can be used to determine the expected number of observed values for any arbitrary sampling. For example, if you sampled 172 fishes, you would expect the following proportions.

p <- c( 37/50, 13/50)
p * 172 -> Expected
Expected
[1] 127.28  44.72

Which can be generalized as the vector of Observed data,

\[ O = [O_1, O_2] \]

A Test Statistic

The test statistic here, \(T\), is defined as the standardized distance between observed and expected observations.

\[ T = \sum_{i=1}^c \frac{(O_i - E_i)^2}{E_i} \]

 

Which is distributed (with some assumptions) as a \(\chi^2\) random variable.

Formal Testing

\(H_O: O = E\)

In R, we can test this using the chisq.test() function.

chisq.test(x, y = NULL, correct = TRUE,
           p = rep(1/length(x), length(x)), rescale.p = FALSE,
           simulate.p.value = FALSE, B = 2000)

The salient points are:

  1. The value of \(x\), which is our observed data as a vector of counts.
  2. The expected frequencies (the \(p_i\) values). If you look at he function above, if you do not specify values for \(p\) it will assume the categories are expected in equal proportions (e.g., \(p_1 = p_2 = \ldots = p_k\)). This is not what we want, so we’ll have to supply those expected proportions to the function directly.

Running the Test

Fishes <- c(108,64)
p <- c(0.74, 0.26)
fit <- chisq.test(x=Fishes, p = p )

The function returns the result as a ‘list-like’ object which we can look at by printing it out.

fit

    Chi-squared test for given probabilities

data:  Fishes
X-squared = 11.233, df = 1, p-value = 0.0008037

Two Categories of Factors

We can expand this approach to two or more categorical data types. Consider the following data data set in R for hair and eye colors collected in 1974 at the University of Delaware (n.b., this is a 3-dimensional matrix of observations):

HairEyeColor
, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

Simplification

For simplicity, I’m going to combine the 3rd dimension to produce a singel 4x4 matrix of observations.

hairAndEye <- HairEyeColor[,,1] + HairEyeColor[,,2]
hairAndEye
       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16

Assumptions:
1. Each row is a random sample relative to the column.
2. Each column is a random sample relative to the row.

Consequences

hairAndEye[1,] -> x
x
Brown  Blue Hazel Green 
   68    20    15     5 

Is a random sample whose proportions

x / sum(x)
    Brown      Blue     Hazel     Green 
0.6296296 0.1851852 0.1388889 0.0462963 

are one sample from the larger set of data (remaining columns) whose proportions are:

colSums( hairAndEye ) -> tot
tot / sum(tot)
    Brown      Blue     Hazel     Green 
0.3716216 0.3631757 0.1570946 0.1081081 

Visual Description

Observations

The raw data consists of a matrix of values.

\[ \begin{bmatrix} O_{11} & O_{12} & \ldots & O_{1c} \\ O_{21} & O_{22} & \ldots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ O_{r1} & O_{r2} & \ldots & O_{rc} \end{bmatrix} \]

Row Marginals

The factor represented by the row variable can be summed across all columns.

\[ R = \begin{bmatrix} R_{1} \\ R_{2} \\ \vdots \\ R_{r} \end{bmatrix} \]

Column Marginals

And the data represented by the columns can be summed as:

\[ C = \begin{bmatrix} C_{1} & C_{2} & \ldots & C_{c} \end{bmatrix} \]

And the total number of observations, \(N\), is given by

\[ N = \sum_{i=1}^r R_i \]

or

\[ N = \sum_{j=1}^c{C_j} \]

The Null Hypothesis

For contingency tables, we are assuming, under the null hypothesis, that the variables represented in counts of rows and columns are independent of each other.

\(H_O:\) The event ‘an observation in row i’ is independent of the event ‘the same observation in column j’ for all i and j.

 

Or if we want to shorten it

\(H_O:\) Hair and eye colors are independent traits.

Deriving the Expectations

The expected values for each row (i) and column (j) are estimated as:

\[ E_{i,j} = R_i * C_j / N \]

The Test Statistic

Similar to the one-row example above, the test statistic for larger tables are the standardized differences between observed and expected values.

\[ T = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

 

Which is evaluated against the \(\chi^2\) distribution to ascertain statistical confidence.

Evaluating in R

In R, it is slightly simplified as:

fit <-  chisq.test(hairAndEye)

Which, as normal, returns a list-like response with all the data we need.

Interpretation

It is a bit less verbose than other examples but has the needed information.

fit

    Pearson's Chi-squared test

data:  hairAndEye
X-squared = 138.29, df = 9, p-value < 2.2e-16

 

names(fit)
[1] "statistic" "parameter" "p.value"   "method"    "data.name" "observed" 
[7] "expected"  "residuals" "stdres"   

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored